{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#hide\n",
    "from utils import *"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "[[chapter_arch_details]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Application architectures deep dive"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We are now in the exciting position that we can fully understand the entire architectures that we have been using for our state-of-the-art models for computer vision, natural language processing, and tabular analysis. In this chapter, we're going to fill in all the missing details on how fastai's application models work."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Computer vision"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### cnn_learner"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's take a look at what happens when we use the `cnn_learner` function. We pass it an architecture to use for the *body* of the network. Most of the time we use a resnet, which we already know how to create, so we don't need to delve into that any further. Pretrained weights are downloaded as required and loaded into the resnet.\n",
    "\n",
    "Then, for transfer learning, the network needs to be *cut*. This refers to slicing off the final layer, which is only responsible for ImageNet-specific categorisation. In fact, we do not only slice off this layer, but everything from the adaptive average pooling layer onwards. The reason for this will become clear in just a moment. Since different architectures might use different types of pooling layers, or even completely different kinds of *heads*, we don't just search for the adaptive pooling layer to decide where to cut the pretrained model. Instead, we have a dictionary of information that is used for each model to know where its body ends, and its head starts. We call this `model_meta` — here it is for resnet 50:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'cut': -2,\n",
       " 'split': <function fastai2.vision.learner._resnet_split(m)>,\n",
       " 'stats': ([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])}"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model_meta[resnet50]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> jargon: Body and Head: The \"head\" of a neural net is the part that is specialized for a particular task. For a convnet, it's generally the part after the adaptive average pooling layer. The \"body\" is everything else, and includes the \"stem\" (which we learned about in <<chapter_resnet>>)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we take all of the layers prior to the cutpoint of `-2`, we get the part of the model which fastai will keep for transfer learning. Now, we put on our new head. This is created using the function create_head:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Sequential(\n",
       "  (0): AdaptiveConcatPool2d(\n",
       "    (ap): AdaptiveAvgPool2d(output_size=1)\n",
       "    (mp): AdaptiveMaxPool2d(output_size=1)\n",
       "  )\n",
       "  (1): full: False\n",
       "  (2): BatchNorm1d(20, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
       "  (3): Dropout(p=0.25, inplace=False)\n",
       "  (4): Linear(in_features=20, out_features=512, bias=False)\n",
       "  (5): ReLU(inplace=True)\n",
       "  (6): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
       "  (7): Dropout(p=0.5, inplace=False)\n",
       "  (8): Linear(in_features=512, out_features=2, bias=False)\n",
       ")"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#hide_output\n",
    "create_head(20,2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```\n",
    "Sequential(\n",
    "  (0): AdaptiveConcatPool2d(\n",
    "    (ap): AdaptiveAvgPool2d(output_size=1)\n",
    "    (mp): AdaptiveMaxPool2d(output_size=1)\n",
    "  )\n",
    "  (1): Flatten()\n",
    "  (2): BatchNorm1d(20, eps=1e-05, momentum=0.1, affine=True)\n",
    "  (3): Dropout(p=0.25, inplace=False)\n",
    "  (4): Linear(in_features=20, out_features=512, bias=False)\n",
    "  (5): ReLU(inplace=True)\n",
    "  (6): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True)\n",
    "  (7): Dropout(p=0.5, inplace=False)\n",
    "  (8): Linear(in_features=512, out_features=2, bias=False)\n",
    ")\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With this function you can choose how many additional linear layers are added to the end, how much dropout to use after each one, and what kind of pooling to use. By default, fastai will apply both average pooling, and max pooling, and will concatenate the two together (this is the `AdaptiveConcatPool2d` layer). This is not a particularly common approach, but it was developed independently at fastai and at other research labs in recent years, and tends to provide some small improvement over using just average pooling.\n",
    "\n",
    "Fastai is also a bit different to most libraries in adding two linear layers, rather than one, by default in the CNN head. The reason for this is that transfer learning can still be useful even, as we have seen, and transferring two very different domains to the pretrained model. However, just using a single linear layer is unlikely to be enough. So we have found that using two linear layers can allow transfer learning to be used more quickly and easily, in more situations."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> note: One parameter to create_head that is worth looking at is bn_final. Setting this to true will cause a batchnorm layer to be added as your final layer. This can be useful in helping your model to more easily ensure that it is scaled appropriately for your output activations. We haven't seen this approach published anywhere, as yet, but we have found that it works well in practice, wherever we have used it."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### unet_learner"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One of the most interesting architectures in deep learning is the one that we used for segmentation in <<chapter_intro>>. Segmentation is a challenging task, because the output required is really an image, or a pixel grid, containing the predicted label for every pixel. There are other tasks which share a similar basic design, such as increasing the resolution of an image (*super resolution*), adding colour to a black-and-white image (*colorization*), or converting a photo into a synthetic painting (*style transfer*)--these tasks are covered by an online chapter of this book, so be sure to check it out after you've read this chapter. In each case, we are starting with an image, and converting it to some other image of the same dimensions or aspect ratio, but with the pixels converted in some way. We refer to these as *generative vision models*.\n",
    "\n",
    "The way we do this is to start with the exact same approach to developing a CNN head as we saw above. We start with a ResNet, for instance, and cut off the adaptive pooling layer and everything after that. And then we replace that with our custom head which does the generative task.\n",
    "\n",
    "There was a lot of handwaving in that last sentence! How on earth do we create a CNN head which generates an image? If we start with, say, a 224 pixel input image, then at the end of the resnet body we will have a 7x7 grid of convolutional activations. How can we convert that into a 224 pixel segmentation mask?\n",
    "\n",
    "We will (naturally) do this with a neural network! So we need some kind of layer which can increase the grid size in a CNN. One very simple approach to this is to replace every pixel in the 7x7 grid with four pixels in a 2x2 square. Each of those four pixels would have the same value — this is known as nearest neighbour interpolation. PyTorch provides a layer which does this for us, so we could create a head which contains stride one convolutional layers (along with batchnorm and ReLU as usual) interspersed with 2x2 nearest neighbour interpolation layers. In fact, you could try this now! See if you can create a custom head designed like this, and see if it can complete the CamVid segmentation task. You should find that you get some reasonable results, although it won't be as good as our <<chapter_intro>> results.\n",
    "\n",
    "Another approach is to replace the nearest neighbour and convolution combination with a *transposed convolution* otherwise known as a *stride half convolution*. This is identical to a regular convolution, but first zero padding is inserted between every pixel in the input. This is easiest to see with a picture — here's a diagram from the excellent convolutional arithmetic paper we have seen before, showing a 3x3 transposed convolution applied to a 3x3 image:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img alt=\"A transposed convolution\" width=\"815\" caption=\"A transposed convolution\" id=\"transp_conv\" src=\"images/att_00051.png\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you see, the result of this is to increase the size of the input. You can try this out now, by using fastai's ConvLayer class; pass the parameter `transpose=True` to create a transposed convolution, instead of a regular one, in your custom head.\n",
    "\n",
    "Neither of these approaches, however, works really well. The problem is that our 7x7 grid simply doesn't have enough information to create a 224x224 pixel output. It's asking an awful lot of the activations of each of those grid cells to have enough information to fully regenerate every pixel in the output. The solution to this problem is to use skip connections, like in a resnet, but skipping from the activations in the body of the resnet all the way over to the activations of the transposed convolution on the opposite side of the architecture. This is known as a U-Net, and it was developed in the 2015 paper [U-Net: Convolutional Networks for Biomedical Image Segmentation](https://arxiv.org/abs/1505.04597). Although the paper focussed on medical applications, the U-Net has revolutionized all kinds of generation vision models.\n",
    "\n",
    "The U-Net paper shows the architecture like this:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img alt=\"The U-net architecture\" width=\"630\" caption=\"The U-net architecture\" id=\"unet\" src=\"images/att_00052.png\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This picture shows the CNN body on the left (in this case, it's a regular CNN, not a ResNet, and they're using 2x2 max pooling instead of stride 2 convolutions, since this paper was written before ResNets came along) and it shows the transposed convolutional layers on the right (they're called \"up-conv\" in this picture). Then then extra skip connections are shown as grey arrows crossing from left to right (these are sometimes called *cross connections*). You can see why it's called a \"U-net\" when you see this picture!\n",
    "\n",
    "With this architecture, the input to the transposed convolutions is not just the lower resolution grid in the preceding layer, but also the higher resolution grid in the resnet head. This allows the U-Net to use all of the information of the original image, as it is needed. One challenge with U-Nets is that the exact architecture depends on the image size. fastai has a unique `DynamicUnet` class which auto-generates an architecture of the right size based on the data provided."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Natural language processing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we've seen how to create complete state of the art computer vision models, let's move on to NLP.\n",
    "\n",
    "Converting an AWD-LSTM language model into a transfer learning classifier follows a very similar process to what we saw for `cnn_learner` in the first section of this chapter. We do not need a \"meta\" dictionary in this case, because we do not have such a variety of architectures to support in the body. All we need to do is to select the stacked RNN for the encoder in the language model, which is a single PyTorch module. This encoder will provide an activation for every word of the input, because a language model needs to output a prediction for every next word.\n",
    "\n",
    "To create a classifier from this we use an approach described in the ULMFiT paper as \"BPTT for Text Classification (BPT3C)\". The paper describes this:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> In order to make fine-tuning a classifier for large documents feasible, we propose BPTT for Text Classification (BPT3C): We divide the document into fixed-length batches of size `b`. At the beginning of each batch, the model is initialized with the final state of the previous batch; we keep track of the hidden states for mean and max-pooling; gradients are back-propagated to the batches whose hidden states contributed to the final prediction. In practice, we use variable length backpropagation sequences."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In practice, what this is saying is that the classifier contains a for loop, which loops over each batch of a sequence. The state is maintained across batches, and the activations of each batch are stored. At the end, we use the same average and max concatenated pooling trick that we use for computer vision models — but this time, we do not pool over CNN grid cells, but over RNN sequences.\n",
    "\n",
    "For this for loop we need to gather our data in batches, but each text needs to be treated separately, as they each have their own label. However, it's very likely that those texts won't have the good taste of being all of the same length, which means we won't be able to put them all in the same array, like we did with the language model.\n",
    "\n",
    "That's where padding is going to help: when grabbing a bunch of texts, we determine the one with the greater length, then we fill the ones that are shorter with a special token called `xxpad`. To avoid having an extreme case where we have a text with 2,000 tokens in the same batch as a text with 10 tokens (so a lot of padding, and a lot of wasted computation) we alter the randomness by making sure texts of comparable size are put together. It will still be in a somewhat random order for the training set (for the validation set we can simply sort them by order of length), but not completely random.\n",
    "\n",
    "This is done automatically behind the scenes by the fastai library when creating our `DataLoaders`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Tabular"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we can look at `fastai.tabular` models. (We don't need to look at collaborative filtering separately, since we've already seen that these models are just tabular models, or use dot product, which we've implemented earlier from scratch.\n",
    "\n",
    "Here is the forward method for `TabularModel`:\n",
    "\n",
    "```python\n",
    "if self.n_emb != 0:\n",
    "    x = [e(x_cat[:,i]) for i,e in enumerate(self.embeds)]\n",
    "    x = torch.cat(x, 1)\n",
    "    x = self.emb_drop(x)\n",
    "if self.n_cont != 0:\n",
    "    x_cont = self.bn_cont(x_cont)\n",
    "    x = torch.cat([x, x_cont], 1) if self.n_emb != 0 else x_cont\n",
    "return self.layers(x)\n",
    "```\n",
    "\n",
    "We won't show `__init__` here, since it's not that interesting, but will look at each line of code in turn in `forward`:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```python\n",
    "if self.n_emb != 0:\n",
    "```\n",
    "\n",
    "This is just testing whether there are any embeddings to deal with — we can skip this section if we only have continuous variables.\n",
    "\n",
    "```python\n",
    "    x = [e(x_cat[:,i]) for i,e in enumerate(self.embeds)]\n",
    "```\n",
    "\n",
    "`self.embeds` contains the embedding matrices, so this gets the activations of each…\n",
    "\n",
    "```python\n",
    "    x = torch.cat(x, 1)\n",
    "```\n",
    "\n",
    "…and concatenates them into a single tensor.\n",
    "\n",
    "```python\n",
    "    x = self.emb_drop(x)\n",
    "```\n",
    "\n",
    "Then dropout is applied. You can pass `emb_drop` to `__init__` to change this value.\n",
    "\n",
    "```python\n",
    "if self.n_cont != 0:\n",
    "```\n",
    "\n",
    "Now we test whether there are any continuous variables to deal with.\n",
    "\n",
    "```python\n",
    "    x_cont = self.bn_cont(x_cont)\n",
    "```\n",
    "\n",
    "They are passed through a batchnorm layer…\n",
    "\n",
    "```python\n",
    "    x = torch.cat([x, x_cont], 1) if self.n_emb != 0 else x_cont\n",
    "```\n",
    "\n",
    "…and concatenated with the embedding activations, if there were any.\n",
    "\n",
    "```python\n",
    "return self.layers(x)\n",
    "\n",
    "```\n",
    "\n",
    "Finally, this is passed through the linear layers (each of which includes batchnorm, if `use_bn` is True, and dropout, if `ps` is set to some value or list of values)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Wrapping up architectures"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see, the details of deep learning architectures need not scare you now. You can look inside the code of fastai and PyTorch and see just what is going on. More importantly, try to understand why that is going on. Take a look at the papers that are being implemented in the code, and try to see how the code matches up to the algorithms that are described.\n",
    "\n",
    "Now that we have investigated all of the pieces of a model and the data that is passed into it, we can consider what this means for practical deep learning. If you have unlimited data, unlimited memory, and unlimited time, then the advice is easy: train a huge model on all of your data for a really long time. The reason that deep learning is not straightforward is because your data, memory, and time is limited. If you are running out of memory or time, then the solution is to train a smaller model. If you are not able to train for long enough to overfit, then you are not taking advantage of the capacity of your model.\n",
    "\n",
    "So step one is to get to the point that you can overfit. Then, the question is how to reduce that overfitting. Here is how we recommend prioritising the steps from there:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img alt=\"Steps to reducing over-fitting\" width=\"400\" caption=\"Steps to reducing over-fitting\" id=\"reduce_overfit\" src=\"images/att_00047.png\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Many practitioners when faced with an overfitting model start at exactly the wrong end of this diagram. Their starting point is to use a smaller model, or more regularisation. Using a smaller model should be absolutely the last step you take, unless your model is taking up too much time or memory. Reducing the size of your model as reducing the ability of your model to learn subtle relationships in your data.\n",
    "\n",
    "Instead, your first step should be to seek to create more data. That could involve adding more labels to data that you already have in your organisation, finding additional tasks that your model could be asked to solve (or to think of it another way, identifying different kinds of labels that you could model), or creating additional synthetic data via using more or different data augmentation. Thanks to the development of mixup and similar approaches, effective data augmentation is now available for nearly all kinds of data.\n",
    "\n",
    "Once you've got as much data as you think you can reasonably get a hold of, and are using it as effectively as possible by taking advantage of all of the labels that you can find, and all of the augmentation that make sense, if you are still overfitting and you should think about using more generalisable architectures. For instance, adding batch normalisation may improve generalisation.\n",
    "\n",
    "If you are still overfitting after doing the best you can at using your data and tuning your architecture, then you can take a look at regularisation. Generally speaking, adding dropout to the last layer or two will do a good job of regularising your model. However, as we learnt from the story of the development of AWD-LSTM, it is often the case that adding dropout of different types throughout your model can help regularise even better. Generally speaking, a larger model with more regularisation is more flexible, and can therefore be more accurate, and a smaller model with less regularisation.\n",
    "\n",
    "Only after considering all of these options would be recommend that you try using smaller versions of your architectures."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Questionnaire"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. What is the head of a neural net?\n",
    "1. What is the body of a neural net?\n",
    "1. What is \"cutting\" a neural net? Why do we need to do this for transfer learning?\n",
    "1. What is \"model_meta\"? Try printing it to see what's inside.\n",
    "1. Read the source code for `create_head` and make sure you understand what each line does.\n",
    "1. Look at the output of create_head and make sure you understand why each layer is there, and how the create_head source created it.\n",
    "1. Figure out how to change the dropout, layer size, and number of layers created by create_cnn, and see if you can find values that result in better accuracy from the pet recognizer.\n",
    "1. What does AdaptiveConcatPool2d do?\n",
    "1. What is nearest neighbor interpolation? How can it be used to upsample convolutional activations?\n",
    "1. What is a transposed convolution? What is another name for it?\n",
    "1. Create a conv layer with `transpose=True` and apply it to an image. Check the output shape.\n",
    "1. Draw the u-net architecture.\n",
    "1. What is BPTT for Text Classification (BPT3C)?\n",
    "1. How do we handle different length sequences in BPT3C?\n",
    "1. Try to run each line of `TabularModel.forward` separately, one line per cell, in a notebook, and look at the input and output shapes at each step.\n",
    "1. How is `self.layers` defined in `TabularModel`?\n",
    "1. What are the five steps for preventing over-fitting?\n",
    "1. Why don't we reduce architecture complexity before trying other approaches to preventing over-fitting?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Further research"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. Write your own custom head and try training the pet recognizer with it. See if you can get a better result than fastai's default.\n",
    "1. Try switching between AdaptiveConcatPool2d and AdaptiveAvgPool2d in a CNN head and see what difference it makes.\n",
    "1. Write your own custom splitter to create a separate parameter group for every resnet block, and a separate group for the stem. Try training with it, and see if it improves the pet recognizer.\n",
    "1. Read the online chapter about generative image models, and create your own colorizer, super resolution model, or style transfer model.\n",
    "1. Create a custom head using nearest neighbor interpolation and use it to do segmentation on Camvid."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "jupytext": {
   "split_at_heading": true
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.5"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": false,
   "sideBar": true,
   "skip_h1_title": true,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
